Classificaiton

Advance Analytics with R (UG 21-24)

Ayush Patel

Before we start

Please load the following packages

library(tidyverse)
library(MASS)
library(ISLR)
library(ISLR2)



Access lecture slide from bit.ly/aar-ug

Warrior's armor(gusoku)
Source: Armor (Gusoku)

Hello

I am Ayush.

I am a researcher working at the intersection of data, law, development and economics.

I teach Data Science using R at Gokhale Institute of Politics and Economics

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Reach me

ayush.ap58@gmail.com

ayush.patel@gipe.ac.in

Learning Objective

Dip our toes into classification techniques. How to apply and assess these methods.

References for this lecture:

  • Chapter 4, ISLR (reference)
  • Chapters 9, Intro to Modern Statistics (Reading for intuitive understanding)
  • Chapter 10.2 Modern Data Science with R

What is Classification?

  • Predict qualitative response
  • Approaches of predicting qualitative response, a process called classification.
  • A method or technique can be referred to as a classifier.
  • We will look into: logistic regression, linear discriminant analysis, quadratic discriminant analysis, naive Bayes and K-nearest neighbours

What actually happens

….often the methods used for classification first predict the probability that the observation belongs to each of the categories of a qualitative variable, as the basis for making the classification. In this sense they also behave like regression methods.

Why not use linear regression??

  • Nominal categorical variables have no rank. How to provide quantitative values?
  • Distance between Ordinal variable values are not easy to assign.
  • Could do something when the response is nominal with only two levels.
  • No guarantee that our estimates will be between [0,1]. Makes interpreting probabilities difficult.

Default data

default student balance income
No No 729.5265 44361.625
No Yes 817.1804 12106.135
No No 1073.5492 31767.139
No No 529.2506 35704.494
No No 785.6559 38463.496
No Yes 919.5885 7491.559
No No 825.5133 24905.227
No Yes 808.6675 17600.451
No No 1161.0579 37468.529
No No 0.0000 29275.268
No Yes 0.0000 21871.073
No Yes 1220.5838 13268.562
No No 237.0451 28251.695
No No 606.7423 44994.556
No No 1112.9684 23810.174
No No 286.2326 45042.413
No No 0.0000 50265.312
No Yes 527.5402 17636.540
No No 485.9369 61566.106
No No 1095.0727 26464.631

Logistic Regression

  • Logistic regressions are well suited for qualitative binary responses.
  • default variable from Default is our response(\(Y\)).
  • It has two levels Yes or No.
  • We model the probability that \(Y\) belongs to one a particular category.
  • \(Pr(default = Yes|balance)\) - logistic model estimates this. Is referred to as \(p(balance)\) as well.
  • Mainly, depending on risk aversion behaviour, \(a\) is chosen. \(p(balance) > a\), where \(0<=a<=1\).

But what if ?

I ran this: \(p(balance) = \beta_0 + \beta_1X\)

## make a dummy for default

Default|>
  mutate(
    default_dumm = ifelse(
      default == "Yes",
      1,0
    )
  )-> def_dum

## regress dummy over balance and plot 

lm(default_dumm ~ balance, 
   data = def_dum)|>
  broom::augment()|>
  ggplot(aes(balance,default_dumm))+
  geom_point(alpha= 0.6)+
  geom_line(aes(balance, .fitted),
            colour = "red")+
  labs(
    title = "Linear regression fit to qualitative response",
    subtitle = "Yes =1, No = 0",
    y = "prob default status"
  )+
  theme_minimal() -> plot_linear

## Run the logistic regression

glm(
  default_dumm ~ balance,
  data = def_dum,
  family = binomial
)|>
  broom::augment(type.predict = "response")|>
  ggplot(aes(balance,default_dumm))+
  geom_point(alpha= 0.6)+
  geom_line(aes(balance, .fitted),
            colour = "red")+
  labs(
    title = "Logistic regression fit to qualitative response",
    subtitle = "Yes =1, No = 0",
    y = "prob default status"
  )+
  theme_minimal() -> logistic_plot

Logistic Model

We saw that some fitted values in the linear model were negative.

We need a function that will return values between [0,1].

\[p(X) = \frac{e^{(\beta_0 + \beta_1X)}}{1+e^{\beta_0 + \beta_1X}}\]

This is the logistic function, modeled by the maximum likelihood method.

odds:

\[\frac{p(X)}{1-p(X)}\] **log odds or logit:

\[log(\frac{p(X)}{1-p(X)}) = \beta_0 + \beta_1X\]

Exercise - concept

if the following are the results of the model \(logit(p(default)) = \beta_0 + \beta_1Balance\):

term estimate std.error statistic p.value
(Intercept) -10.651330614 0.3611573721 -29.49221 3.623124e-191
balance 0.005498917 0.0002203702 24.95309 1.976602e-137

What is the probability of default with balance $5000??

Multiple logistic Regression

\[p(X) = \frac{e^{(\beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_nX_n)}}{1+e^{\beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_nX_n}}\]

term estimate std.error statistic p.value
(Intercept) -1.086905e+01 4.922555e-01 -22.080088 4.911280e-108
income 3.033450e-06 8.202615e-06 0.369815 7.115203e-01
balance 5.736505e-03 2.318945e-04 24.737563 4.219578e-135
studentYes -6.467758e-01 2.362525e-01 -2.737646 6.188063e-03
term estimate std.error statistic p.value
(Intercept) -3.5041278 0.07071301 -49.554219 0.0000000000
studentYes 0.4048871 0.11501883 3.520181 0.0004312529

How to know if its good?

There is no consesus in statistics community over a single measure that can describe a goodness of fit for logistic regression.

glm(
  default_dumm ~ income + balance + student,
  data = def_dum,
  family = binomial
) -> mod_logit

DescTools::PseudoR2(mod_logit,
                    which = c("McFadden", "CoxSnell",
                              "Nagelkerke", "Tjur"))
  McFadden   CoxSnell Nagelkerke       Tjur 
 0.4619194  0.1262059  0.4982860  0.3355203 
AIC(mod_logit) # be careful with this
[1] 1579.545